Principal Components Analysis Using Averaged Eigenvector Centrality Over Secondary Structures¶


In [ ]:
# Import necessary packages to perform the analysis demonstrated below.

import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from matplotlib_venn import venn2,venn2_circles
from matplotlib import pyplot as plt
import csv
from plotly.subplots import make_subplots
import plotly.graph_objects as go

Preliminary inspection of the data shows that there are differences in how eigenvector centrality is distributed across the primary structures of LRH-1 (See multiplot below).

Although this is readily observable on inspection, what is needed is a rigorous approach that demonstartes that these differences are categorical.

The principal Components analysis demonstrates a clear grouping.

In [29]:
fig, axs = plt.subplots(5,4,figsize=(30,20),sharey=True, sharex=True)
pdbList=[['1yok','1yuc','1zdu','3plz'],['3tx7','4dor','4dos','4oni'],['4ple','4rwv','6vc2','NA'],
         ['4pld','5l11','5syz','5unj'],['6oqx','6oqy','6or1','NA']]

path = "/Users/davidfoutch/Desktop/David/"

for j in range(5):
    for k in range(4):
        X = []
        Y = []
        if pdbList[j][k] != 'NA':
            i=pdbList[j][k]
            pathToFile=path+i+"/eigen_"+i+"_plt"
            with open(pathToFile, 'r') as file:
                plotting = csv.reader(file, delimiter='\t')
                for ROWS in plotting:
                    X.append(int(ROWS[0]))
                    Y.append(float(ROWS[1]))

                axs[j,k].plot(X, Y, color="black", label=i)
                axs[j,k].legend(fontsize="12")


# plt.xticks(np.arange(210, 485, 25))

# # plt.title('')
fig.text(0.075, 0.5, 'Eigenvector Centrality', va='center', rotation='vertical',fontsize=18)
fig.text(0.5, 0.055, 'Residue Position', ha='center',fontsize=18)

# plt.savefig('nr5a1_plts.png')
Out[29]:
Text(0.5, 0.055, 'Residue Position')
No description has been provided for this image

Eiegenvector Centrality Across LRH-1 Primary Structure. The x-axis plots the physical sequence (or, primary structure) of each residue of the protein. The y-axis plots the eigenvector centrality value for each residue in the sequence. On visual inspection it seems reasonable that these these protein systems may be clustered into two groups. It also seems reasonable that principal components analysis PCA)is a reasonable approach for investigating what features are driving this variance.



Secondary Protein Structures as Features¶


As indicated in the illustration below, the LRH-1 secondary structures were identified. Each of the structures were assigned an average eigenvector centrality value based on the eigenvector values of each residue in the structure. This was performed for each of the 18 LRH-1 structures under consideration.

ECSecStruct.png

Secondary Structure as Features. The LRH-1 protein is a sequence of 242 residues. To purpose of this PCA is to find some measure of variation in the eigenvector values that might reproduce the visual grouping of the 18 structures. The approach we adopted was to identify the primary structure (or, sequence) associated with each of the secondary structures and average the eigenvector values for each residue within each secondary structure. This figure illustrates how each secondary structure (loops, helices, and sheets) was treated as a group of residues. Here the protein structures associated with each PDB file are the targets and the secondary structures are the features (See table below).



Protein Structures as Targets (rows) and Secondary Structures as Features (columns)¶


In [53]:
#Load the table as a Pandas dataframe.
"""Note: That the 'pdb_file' entries contain the list of 'targets' used in the PCA described below.
   The column names are the list of secondary structures of LRH-1. Each (i,j) entry is the average EC value 
   for the secondary structure associated with the PDB file of that row.""" 

df=pd.read_csv('/Users/davidfoutch/Desktop/David/secondary_structure_pca.txt',sep='\t',header=0)
df.style.set_table_attributes('style="font-size: 16px"')
Out[53]:
  pdb_file helix1 helix2 loop1 helix3 helix4 helix5 sheet1 loop2 sheet2 helix6 helix7 helix8 loop3 helix9 loop4 helix10 helix11 loop5 helix12
0 1yok 0.000576 0.003654 0.000314 0.031029 0.001018 0.052709 0.150762 0.053812 0.060763 0.026101 0.042471 0.004820 0.000790 0.000157 0.000023 0.000498 -0.029088 0.004879 0.061481
1 1yuc 0.000743 0.012840 0.000890 0.057354 0.001669 0.060717 0.056455 0.076121 0.018289 0.008388 0.008053 0.005091 0.003343 0.000391 0.000141 0.001046 -0.048620 0.027062 0.039156
2 1zdu 0.000202 0.007947 0.000460 0.050669 0.002127 0.050298 0.039572 0.072325 0.017996 0.018922 0.018822 0.002609 0.001172 0.000081 0.000011 0.000377 -0.050530 0.020620 0.072636
3 3plz 0.001271 0.015656 0.000000 0.055139 0.004365 0.054102 0.127785 0.116407 0.011096 0.025060 0.021333 0.004178 0.000761 0.001090 0.000129 0.000836 -0.015597 0.023692 0.037046
4 3tx7 0.000625 0.005513 0.000427 0.044520 0.000000 0.043169 0.104700 0.075048 0.072745 0.061702 0.030444 0.001431 0.000022 0.000218 0.000203 0.000574 -0.025690 0.020532 0.054883
5 4dor 0.002067 0.010955 0.001888 0.067321 0.004443 0.058594 0.054307 0.080375 0.018431 0.016435 0.029738 0.007632 0.000997 0.000904 0.000169 0.001584 -0.028626 0.028174 0.040670
6 4dos 0.000402 0.028617 0.005939 0.067000 0.000911 0.038745 0.059149 0.065611 0.060590 0.048827 0.022908 0.002705 0.000157 0.000440 0.000227 0.001477 0.034294 0.036000 0.061556
7 4oni 0.000312 0.008313 0.000734 0.033683 0.000433 0.057376 0.145752 0.071987 0.021481 0.015991 0.030568 0.001800 0.001175 0.000278 0.000082 0.000854 -0.044903 0.017223 0.018641
8 4pld 0.019235 0.030206 0.007326 0.081821 0.029528 0.085718 0.076756 0.061078 0.019302 0.014691 0.054611 0.056935 0.008931 0.023157 0.004250 0.023564 -0.070372 0.014760 0.064728
9 4ple 0.001660 0.002323 0.001876 0.045064 0.001737 0.046692 0.049928 0.015369 0.048707 0.063348 0.031489 0.005911 0.001757 0.000692 0.000124 0.000884 -0.046582 0.030244 0.069936
10 4rwv 0.000914 0.003979 0.000283 0.043673 0.001150 0.043446 0.059859 0.053475 0.037659 0.042981 0.027650 0.003051 0.001014 0.000204 0.000038 0.000390 -0.053159 0.010822 0.074276
11 5l11 0.015617 0.033968 0.012121 0.087246 0.020994 0.090050 0.045139 0.044618 0.016428 0.025267 0.059047 0.058966 0.012628 0.021853 0.007340 0.035950 -0.058355 0.012441 0.040010
12 5syz 0.014761 0.029265 0.010711 0.085142 0.024236 0.094928 0.075324 0.050406 0.017226 0.017767 0.050440 0.057061 0.008488 0.020446 0.007410 0.032384 -0.050732 0.012630 0.053846
13 5unj 0.018808 0.020364 0.001765 0.096289 0.025358 0.088195 0.041148 0.051062 0.011595 0.021755 0.047324 0.058565 0.003736 0.020208 0.002930 0.021128 -0.039532 0.023337 0.063663
14 6oqx 0.026081 0.020759 0.005134 0.083458 0.041482 0.096881 0.030195 0.034281 0.010749 0.018421 0.047028 0.086320 0.018635 0.038344 0.009920 0.045036 -0.053163 0.012589 0.053940
15 6oqy 0.013663 0.026514 0.004330 0.087202 0.020875 0.086448 0.064650 0.067561 0.023111 0.030738 0.052478 0.051809 0.011350 0.018312 0.005780 0.027102 -0.058918 0.011047 0.057732
16 6or1 0.032763 0.021224 0.003412 0.079031 0.029667 0.083683 0.027241 0.040638 0.010503 0.018645 0.044225 0.101694 0.011849 0.046109 0.012700 0.056007 -0.060972 0.011454 0.056959
17 6vc2 0.000540 0.015342 0.000711 0.044759 0.001578 0.053363 0.145899 0.115398 0.021781 0.017204 0.018100 0.004535 0.000440 0.000243 0.000037 0.000387 -0.006567 0.007745 0.010284

Table of Targets and Features. The rows in the table above represents the targets for the PCA, which are the individual LRH-1 structures. The columns represent features, which are the secondary structures. The values in the table are the averaged eigenvector centrality values for each secondary target.



Inspecting PC1 and PC2¶


In [54]:
df=pd.read_csv('/Users/davidfoutch/Desktop/David/transpose_for_pca.txt',sep='\t',header=0)
df=df.drop(['seq_pos','seq_id','sec_struct'], axis=1)


scaler = StandardScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)

pca = PCA(n_components=2)
PC_scores = pd.DataFrame(pca.fit_transform(df_scaled),
               columns = ['PC 1', 'PC 2'])
loadings = pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2'], 
                        index=list(df.columns.values))
loadings.style.set_table_attributes('style="font-size: 16px"')
Out[54]:
  PC1 PC2
1yok 0.230914 -0.181243
1yuc 0.250479 -0.148688
1zdu 0.252150 -0.204559
3plz 0.250873 -0.165109
3tx7 0.228274 -0.262484
4dor 0.262657 -0.129846
4dos 0.024133 0.314485
4oni 0.234650 -0.186726
4pld 0.260003 0.169378
4ple 0.218371 -0.232409
4rwv 0.231815 -0.252638
5l11 0.253263 0.244395
5syz 0.258122 0.243177
5unj 0.254570 0.215808
6oqx 0.217383 0.383928
6oqy 0.273277 0.157543
6or1 0.208507 0.402805
6vc2 0.225098 -0.099870

Generating the PCA Two Component Biplot¶


In [25]:
# The 'targets' are the protein structures represented by the four character name of each PDB file
# and the 'features' are the secondary structures (helices, strands, sheets, loops) associated with LRH-1

targets = ['1yok','1yuc','1zdu','3plz','3tx7','4dor','4dos','4oni','4pld','4ple','4rwv',
    '5l11','5syz','5unj','6oqx','6oqy','6or1','6vc2']

features = ['helix1', 'helix2', 'loop1', 'helix3', 'helix4', 'helix5', 'sheet1',
       'loop2', 'sheet2', 'helix6', 'helix7', 'helix8', 'loop3', 'helix9',
       'loop4', 'helix10', 'helix11', 'loop5', 'helix12']

df=pd.read_csv('/Users/davidfoutch/Desktop/David/secondary_structure_pca.txt',sep='\t',header=0)
df2=df.drop(columns=['pdb_file'])

x = df2.loc[:,features].values
y = df['pdb_file']
y = y.to_frame()


x = StandardScaler().fit_transform(x)

pca = PCA(n_components=2)

principalComponents = pca.fit_transform(x)

principalDF = pd.DataFrame(data = principalComponents, 
                           columns = ['PC 1', 'PC 2'])

finalDF = pd.concat([principalDF, y], axis=1)

scalePC1 = 1.0/(finalDF['PC 1'].max() - finalDF['PC 1'].min())
scalePC2 = 1.0/(finalDF['PC 2'].max() - finalDF['PC 2'].min())


ldngs = pca.components_
featuresList = list(df2.columns.values)

fig, ax = plt.subplots(1,1,figsize=(16,9))

# ax = fig.add_subplot(1,1,1) 

ldngs = pca.components_
featuresList = list(df2.columns.values)

for i, feature in enumerate(featuresList):
    ax.arrow(0, 0, ldngs[0, i], 
             ldngs[1, i], alpha=0.1)
    ax.text(ldngs[0, i] * 1.0, 
            ldngs[1, i] * 1.0, 
            feature, fontsize=12, alpha=0.5)

ax.set_xlabel('Principal Component 1', fontsize = 16)
ax.set_ylabel('Principal Component 2', fontsize = 16)
ax.set_title('Two Component Biplot PCA', fontsize = 20)



set9 = finalDF[finalDF['pdb_file'].isin(['1yok','1yuc','1zdu','3tx7','3plz','4dor', '4dos','4oni','4ple','4rwv','6vc2'])]

set10 = finalDF[finalDF['pdb_file'].isin(['4pld','5l11','5syz','5unj','6oqx','6oqy','6or1'])]



ax.scatter(set9['PC 1']*scalePC1, set9['PC 2']*scalePC2,color='red',marker='x',label='Small molecule')
ax.scatter(set10['PC 1']*scalePC1, set10['PC 2']*scalePC2,color='blue',label='Phospholipid')

ax.legend(loc='upper right',fancybox=True,fontsize=14)
ax.grid()    

for i in range(11):
    plt.text(set9.iloc[i,0]*scalePC1, set9.iloc[i,1]*scalePC2+0.01, set9.iloc[i,2], color='red',fontsize = 14)

for i in range(7):
    plt.text(set10.iloc[i,0]*scalePC1, set10.iloc[i,1]*scalePC2-0.03, set10.iloc[i,2], color='blue',fontsize = 14)  
    
# plt.savefig('/Users/davidfoutch/Desktop/David/new_two_component_biplot_pca.png') 
No description has been provided for this image

Review of the Result and Interpretation¶



There are two key results that offer insight into why these two groups of proteins may be sorted by their energetic equilibrium, or $\Delta\Delta$G (see Fig 2). First, when the average eigenvector centrality of each secondary structure (helices, sheets, and loops) for each protein was used as features in the principal components analysis (PCA) the low $\Delta\Delta$G and the high $\Delta\Delta$G proteins clustereded nicely into two groups (see Fig 2A). What this indicates is that there is a set of secondary structures that are creating the variance between the equilibrium states. The bi-plot of the PCA indicated that helix 6 and helix 3 are significantfactors to understanding this variance between the two groups. Second, on inspection of the set of edges unique to the low $\Delta\Delta$G PSNs and high $\Delta\Delta$G PSNs. It is observed that there is a greater degree of connectivity between helices 3 and 6 at the enzymatic site indicating that the opening is more restricted requiring greater energy to achieve small molecule binding to reach energetic equilibrium (see Fig 2B).

solution.png

In [ ]: